-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Refactor Dataset.map to merge attrs instead of copying #11020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Update the `keep_attrs` behavior in `Dataset.map()` and `DataTree.map()` to merge attributes from the original and function results using the `drop_conflicts` strategy, rather than unconditionally copying original attrs. When `keep_attrs=True`, matching attrs are kept and conflicting attrs are dropped. When `keep_attrs=False`, only attrs set by the function are retained. Add comprehensive tests for the new attr merging behavior.
Weighted operations internally propagate attrs from weights through computations like dot(). When keep_attrs=False is passed, users expect no attrs on the result, but attrs from weights were leaking through. Clear attrs explicitly in _implementation when keep_attrs is False. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The `test-nightly` environment uses pandas nightly wheels from PyPI, which currently don't have win-64 builds available. This causes `pixi lock` to fail when solving for all platforms. RTD builds fail because they have no lock file cache (unlike GitHub Actions CI which caches pixi.lock). When RTD runs `pixi install -e doc`, pixi must generate the lock file from scratch, which fails on the unsolvable test-nightly/win-64 combination. This restriction can be removed once pandas nightly provides win-64 wheels again. Co-authored-by: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
I'll pipe in to say that this PR greatly reduces the number of test failures MetPy has with the latest xarray, so it's a 👍 from me for what that's worth. |
| if keep_attrs: | ||
| # Merge attrs from function result and original, dropping conflicts | ||
| from xarray.structure.merge import merge_attrs | ||
|
|
||
| for k, v in variables.items(): | ||
| v._copy_attrs_from(self.data_vars[k]) | ||
| v.attrs = merge_attrs( | ||
| [v.attrs, self.data_vars[k].attrs], "drop_conflicts" | ||
| ) | ||
| for k, v in coords.items(): | ||
| if k in self.coords: | ||
| v._copy_attrs_from(self.coords[k]) | ||
| else: | ||
| for v in variables.values(): | ||
| v.attrs = {} | ||
| for v in coords.values(): | ||
| v.attrs = {} | ||
| v.attrs = merge_attrs( | ||
| [v.attrs, self.coords[k].attrs], "drop_conflicts" | ||
| ) | ||
| # When keep_attrs=False, leave attrs as the function returned them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with interpreting keep_attrs=False as "leave the attrs as returned by the function" is that that means we don't have keep_attrs="drop" anymore.
I'd argue that keep_attrs=True should be closer to what you're proposing for keep_attrs=False, which I do think would be more intuitive.
So instead we may need to consider supporting keep_attrs with strategy names / a strategy function, like apply_ufunc does. That would still allow you to choose "drop_conflicts" if preferred (or maybe as the default? Not sure), while not changing behavior too drastically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the proposed code treats keep_attrs=False as "remove all the input attrs". but not "remove all the output attrs".
@keewis can you see a reasonable change to fix the immediate issue without adding a whole strategy to keep_attrs? I don't have a particularly strong view on this specific implementation, but it does seem reasonable / logical, and it does let us solve this immediate bug...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(zooming out — as I mentioned before, for me the best "blank-slate" implementation for keep_attrs is to mostly not have a the option at all, and folks can drop attrs if they want. though I agree with you that merging is case that neither approach handles well...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the proposed code treats
keep_attrs=Falseas "remove all the input attrs". but not "remove all the output attrs".
keep_attrs as a boolean is always ambiguous. For instance in #10997 I arrived at the conclusion that keep_attrs=False should drop all attrs from the output.
It seems like it wouldn't be too hard to make this more configurable by just passing any string that is passed to keep_args along to merge_attrs like @keewis is suggesting. You would just map True to "override" and False to "drop" (I think that's what the behavior is now on main) and then people who want something else can use a specific string.
|
Is there anything this is waiting on? I'm trying to decide if I need to vendor our own implementation of |
|
@pydata/xarray any thoughts? reasons not to move forward? |
| if keep_attrs: | ||
| # Merge attrs from function result and original, dropping conflicts | ||
| from xarray.structure.merge import merge_attrs | ||
|
|
||
| for k, v in variables.items(): | ||
| v._copy_attrs_from(self.data_vars[k]) | ||
| v.attrs = merge_attrs( | ||
| [v.attrs, self.data_vars[k].attrs], "drop_conflicts" | ||
| ) | ||
| for k, v in coords.items(): | ||
| if k in self.coords: | ||
| v._copy_attrs_from(self.coords[k]) | ||
| else: | ||
| for v in variables.values(): | ||
| v.attrs = {} | ||
| for v in coords.values(): | ||
| v.attrs = {} | ||
| v.attrs = merge_attrs( | ||
| [v.attrs, self.coords[k].attrs], "drop_conflicts" | ||
| ) | ||
| # When keep_attrs=False, leave attrs as the function returned them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the proposed code treats
keep_attrs=Falseas "remove all the input attrs". but not "remove all the output attrs".
keep_attrs as a boolean is always ambiguous. For instance in #10997 I arrived at the conclusion that keep_attrs=False should drop all attrs from the output.
It seems like it wouldn't be too hard to make this more configurable by just passing any string that is passed to keep_args along to merge_attrs like @keewis is suggesting. You would just map True to "override" and False to "drop" (I think that's what the behavior is now on main) and then people who want something else can use a specific string.
Address PR feedback to also clear coordinate attrs (not just data_vars attrs) when keep_attrs=False in both DataArrayWeighted and DatasetWeighted. Added test to verify coord attrs are cleared for both DataArray and Dataset. Co-authored-by: Claude <[email protected]>
Use proper type annotation with DataArray | Dataset union type to avoid incompatible assignment error. Co-authored-by: Claude <[email protected]>
fixes #11019
I'm not sure this is the best solution for the specific case, but it's at least consistent with our existing behavior (I think? not super confident), and is quite reasonable behavior.
other options:
keep_attrs=Falsekeep_attrs=TrueChanges:
Dataset.map()/DataTree.map(): Whenkeep_attrs=True, merge attrs from function result and original usingdrop_conflicts(matching attrs kept, conflicting attrs dropped). Whenkeep_attrs=False, leave attrs as the function returned them.Weighted operations: Explicitly clear attrs when
keep_attrs=False, since internal computations (likedot) propagate attrs from weights.whats-new.rstapi.rst